DITTO

Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

Anonymous Authors
We highly recommend watching the video with sound on.

Dataset: Ditto-1M

A high-quality synthetic dataset for instruction-based video editing, featuring diverse scenarios and comprehensive editing instructions across global and local transformations.

Global Editing

Global editing transforms the entire video with comprehensive changes that affect every frame. This approach applies consistent modifications across the complete temporal sequence, enabling dramatic style transfers, color grading, and atmospheric adjustments that maintain visual coherence throughout the entire video.

Local Editing

Local editing focuses on specific regions or objects within the video, applying precise modifications to targeted areas while preserving the surrounding content. This technique enables selective enhancement, object replacement, and regional adjustments that maintain the integrity of the overall composition.

More Dataset Samples

Model: Editto

A state-of-the-art video editing model trained on the Ditto-1M dataset, demonstrating superior performance across diverse editing scenarios and outperforming existing methods.

Model Results

Showcasing the capabilities of our Editto model across various editing scenarios, from global style transfers to precise local modifications.

Qualitative Comparison

Source TokenFlow InsV2V InsViE Gen4-Aleph Ours

Application Examples

We showcase the synthetic-to-real (sim2real) capability benefited from our data by training the model to map the stylized videos in our dataset back to their original, real- world source videos.

Effectiveness of Denoising Enhancer

Here we demonstrate the effectiveness of denoising enhancer where the raw edited video is provided on the left, and the enhanced one is put on the right (please consider zooming in to see the details).